Identifying collocations using cross-lingual association measures
نویسندگان
چکیده
We introduce a simple and effective crosslingual approach to identifying collocations. This approach is based on the observation that true collocations, which cannot be translated word for word, will exhibit very different association scores before and after literal translation. Our experiments in Japanese demonstrate that our cross-lingual association measure can successfully exploit the combination of bilingual dictionary and large monolingual corpora, outperforming monolingual association measures.
منابع مشابه
A Corpus-based Analysis of Collocational Errors in the Iranian EFL Learners' Oral Production
Collocations are one of the areas generally considered problematic for EFL learners. Iranian learners of English like other EFL learners face various problems in producing oral collocations. An analysis of learners' spoken interlanguage both indicates the scope of the problem and the necessity to spend more time and energy by learners on mastering collocations. The present study specifically f...
متن کاملRetrieving Bilingual Verb-Noun Collocations by Integrating Cross-Language Category Hierarchies
This paper presents a method of retrieving bilingual collocations of a verb and its objective noun from cross-lingual documents with similar contents. Relevant documents are obtained by integrating crosslanguage hierarchies. The results showed a 15.1% improvement over the baseline nonhierarchy model, and a 6.0% improvement over use of relevant documents retrieved from a single hierarchy. Moreov...
متن کاملCan we do better than frequency? A case study on extracting PP-verb collocations
We argue that lexical association measures (AMs) should be evaluated against a reference set of collocations manually extracted from the full candidate data, and that the notion of collocation needs to be precisely defined so that human collocativity judgments and experimental results are reproducible. We show that identification results achieved by particular AMs do not crucially depend on tex...
متن کاملEmpirical Implications on Lexical
An empirical study is presented showing how factors such as co-occurrence frequency, linguistic constraints in the candidate data and type of collocation to be identiied innuence the identiication accuracy achieved, on the one hand, by a mere frequency-based approach and, on the other hand, by well known statistical association measures such as mutual information, Dice coeecient, relative entro...
متن کاملConditional Random Fields for Spanish Named Entity Recognition Using Unsupervised Features
Unsupervised features based on word representations such as word embeddings and word collocations have shown to significantly improve supervised NER for English. In this work we investigate whether such unsupervised features can also boost supervised NER in Spanish. To do so, we use word representations and collocations as additional features in a linear chain Conditional Random Field (CRF) cla...
متن کامل